Based on the feedback from our first submission, this entire section was re-done from scratch. We created both a LR model and a SVM model to classify our data based on the ('ON_TIME_ARRIVAL') variable that we created below, using our 80% accuracy classification metric as our measure of classifer success.
LR model changes: The data went through the same basic manipulation, certain variables that were of no use or that could be derived from other variables were removed. This time we retained categorical variables and one-hot encoded, and all variables not already on a 0-1 scale were rescaled. We also had a 2/3 to 1/3 outcome class imbalance which we corrected for by resampling using SMOTE. We then clssifed on the scaled, hot-encoded, and resampled data using a random seed on the CV object. We achieved an 89% accuracy on the first attempt, we then adjusted the cost paramater to increase accuracy and achieved around 93% accuracy when modifying the cost parameter. Unlike our previous submission, our CV worked, and gave reliable, usable output.
SVM model changes. We used the same data as we manipulated for the LR portion of this assignment. However we did not scale the data and we did not use SMOTE to resample. We still retained categorical variables and one-hot encoded ones. We used 5 fold CV (scaled) data to classify with the SVM. We created a widget to adjust parameters for SVM including cost, gamma, and the kernel. We tested various iterations with a linear, rbf, and poly kernel and varied the cost and gamma values to determine best fit for our data.
This section was completely re-worked. We included a table with results from each modeling run and a discussion to answer the questions: "Does one type of model offer superior performance over another in terms of prediction accuracy? In terms of training time or accuracy? In terms of training time or efficiency? We explained our results in great detail and our thoughts and analysis of both the LR and SVM models. We also reduced our lengthy introduction of
Since the model was completley overhauled, and our original interpretations of the feautres were incorrect, this section was completely re-done. We interpreted each and every feature and its relationship with the response.
Since the model was completely overhauled, and we did not review all of the features we selected, this section was completely re-done. We reviewed 7 features instead of our original three, and captured each of the main categorizes of feature.
First, we import the necessary packages and the data set. We are continuing our use of the DOT airline statistics and investigating techniques to better help travelers determine how to avoid delays when flying. We have broken the project into the required subsections:
-Create Models
-Cleaning up the data
-Code Resuse
-Train/Test Split
-Logistic Regression
-Support Vector Machine
-Model Advantages
-Interpret Feature Importance
-Interpret Support Vectors
#import libraries
from __future__ import print_function
import pandas as pd
import numpy as np
#Read in the flight delay data
#Our data is from Department of transportation
#https://www.bts.gov/topics/airlines-and-airports/number-14-time-reporting
df = pd.read_csv('2018.csv') # read in the csv file
#df.info()
#Reduce dataset to a more manageable size.
#We randomly sampled 150,000 records. This is a reasonable sample size that will not cause excessively long computation times
dfReduced=df.sample(n=150000, random_state=1)
#View some basic information on dataset
dfReduced.info()
#View some summary statistics on variables
dfReduced.describe().apply(lambda s: s.apply('{0:.5f}'.format))
<class 'pandas.core.frame.DataFrame'> Int64Index: 150000 entries, 695058 to 499540 Data columns (total 28 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 FL_DATE 150000 non-null object 1 OP_CARRIER 150000 non-null object 2 OP_CARRIER_FL_NUM 150000 non-null int64 3 ORIGIN 150000 non-null object 4 DEST 150000 non-null object 5 CRS_DEP_TIME 150000 non-null int64 6 DEP_TIME 147634 non-null float64 7 DEP_DELAY 147518 non-null float64 8 TAXI_OUT 147561 non-null float64 9 WHEELS_OFF 147561 non-null float64 10 WHEELS_ON 147492 non-null float64 11 TAXI_IN 147492 non-null float64 12 CRS_ARR_TIME 150000 non-null int64 13 ARR_TIME 147492 non-null float64 14 ARR_DELAY 147112 non-null float64 15 CANCELLED 150000 non-null float64 16 CANCELLATION_CODE 2446 non-null object 17 DIVERTED 150000 non-null float64 18 CRS_ELAPSED_TIME 150000 non-null float64 19 ACTUAL_ELAPSED_TIME 147165 non-null float64 20 AIR_TIME 147165 non-null float64 21 DISTANCE 150000 non-null float64 22 CARRIER_DELAY 28035 non-null float64 23 WEATHER_DELAY 28035 non-null float64 24 NAS_DELAY 28035 non-null float64 25 SECURITY_DELAY 28035 non-null float64 26 LATE_AIRCRAFT_DELAY 28035 non-null float64 27 Unnamed: 27 0 non-null float64 dtypes: float64(20), int64(3), object(5) memory usage: 33.2+ MB
| OP_CARRIER_FL_NUM | CRS_DEP_TIME | DEP_TIME | DEP_DELAY | TAXI_OUT | WHEELS_OFF | WHEELS_ON | TAXI_IN | CRS_ARR_TIME | ARR_TIME | ... | CRS_ELAPSED_TIME | ACTUAL_ELAPSED_TIME | AIR_TIME | DISTANCE | CARRIER_DELAY | WEATHER_DELAY | NAS_DELAY | SECURITY_DELAY | LATE_AIRCRAFT_DELAY | Unnamed: 27 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 150000.00000 | 150000.00000 | 147634.00000 | 147518.00000 | 147561.00000 | 147561.00000 | 147492.00000 | 147492.00000 | 150000.00000 | 147492.00000 | ... | 150000.00000 | 147165.00000 | 147165.00000 | 150000.00000 | 28035.00000 | 28035.00000 | 28035.00000 | 28035.00000 | 28035.00000 | 0.00000 |
| mean | 2610.45619 | 1328.22438 | 1332.47427 | 9.90160 | 17.40465 | 1356.01141 | 1461.80967 | 7.62469 | 1484.53053 | 1466.05086 | ... | 141.13015 | 136.46714 | 111.45453 | 799.62115 | 19.25297 | 3.53080 | 15.95798 | 0.08935 | 25.82215 | nan |
| std | 1864.14742 | 491.01257 | 504.34464 | 45.12021 | 9.95844 | 505.93189 | 532.98471 | 6.11827 | 518.40491 | 537.39677 | ... | 73.50667 | 73.21552 | 71.16854 | 598.75699 | 60.43911 | 27.28905 | 36.98216 | 3.13707 | 51.35604 | nan |
| min | 1.00000 | 1.00000 | 1.00000 | -43.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | 1.00000 | ... | 21.00000 | 18.00000 | 8.00000 | 31.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | nan |
| 25% | 1026.00000 | 913.00000 | 915.00000 | -5.00000 | 11.00000 | 930.00000 | 1044.00000 | 4.00000 | 1059.00000 | 1048.00000 | ... | 88.00000 | 83.00000 | 60.00000 | 361.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | nan |
| 50% | 2131.00000 | 1320.00000 | 1326.00000 | -2.00000 | 15.00000 | 1339.00000 | 1502.00000 | 6.00000 | 1515.00000 | 1505.00000 | ... | 122.00000 | 118.00000 | 92.00000 | 632.00000 | 0.00000 | 0.00000 | 3.00000 | 0.00000 | 3.00000 | nan |
| 75% | 4079.00000 | 1735.00000 | 1742.00000 | 7.00000 | 20.00000 | 1757.00000 | 1910.00000 | 9.00000 | 1917.00000 | 1915.00000 | ... | 171.00000 | 167.00000 | 141.00000 | 1031.00000 | 17.00000 | 0.00000 | 20.00000 | 0.00000 | 31.00000 | nan |
| max | 7439.00000 | 2359.00000 | 2400.00000 | 1603.00000 | 171.00000 | 2400.00000 | 2400.00000 | 170.00000 | 2400.00000 | 2400.00000 | ... | 695.00000 | 690.00000 | 665.00000 | 4983.00000 | 1594.00000 | 976.00000 | 1205.00000 | 398.00000 | 1379.00000 | nan |
8 rows × 23 columns
Next, we create an array of each type of delay. This will be helpful as we will need to handle the missing values for the delays.
#Remove attributes that just arent useful for us
#Each of these variables removed is either of no use or can be derived using other variables in dataset.
del dfReduced['DIVERTED']
del dfReduced['DISTANCE']
del dfReduced['TAXI_OUT']
del dfReduced['TAXI_IN']
del dfReduced['Unnamed: 27']
del dfReduced['CANCELLED']
del dfReduced['CANCELLATION_CODE']
del dfReduced['DEST']
del dfReduced['OP_CARRIER_FL_NUM']
#Bring all delay types into one variable "delayArr"
# df.info()
delayArr = [
'DEP_DELAY'
,'ARR_DELAY'
,'CARRIER_DELAY'
,'WEATHER_DELAY'
,'NAS_DELAY'
,'SECURITY_DELAY'
,'LATE_AIRCRAFT_DELAY'
]
dfReduced[delayArr].describe().apply(lambda s: s.apply('{0:.5f}'.format)) # will get summary of continuous or the nominals non-scientific
| DEP_DELAY | ARR_DELAY | CARRIER_DELAY | WEATHER_DELAY | NAS_DELAY | SECURITY_DELAY | LATE_AIRCRAFT_DELAY | |
|---|---|---|---|---|---|---|---|
| count | 147518.00000 | 147112.00000 | 28035.00000 | 28035.00000 | 28035.00000 | 28035.00000 | 28035.00000 |
| mean | 9.90160 | 4.97184 | 19.25297 | 3.53080 | 15.95798 | 0.08935 | 25.82215 |
| std | 45.12021 | 47.21904 | 60.43911 | 27.28905 | 36.98216 | 3.13707 | 51.35604 |
| min | -43.00000 | -77.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| 25% | -5.00000 | -14.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 0.00000 |
| 50% | -2.00000 | -6.00000 | 0.00000 | 0.00000 | 3.00000 | 0.00000 | 3.00000 |
| 75% | 7.00000 | 8.00000 | 17.00000 | 0.00000 | 20.00000 | 0.00000 | 31.00000 |
| max | 1603.00000 | 1594.00000 | 1594.00000 | 976.00000 | 1205.00000 | 398.00000 | 1379.00000 |
#Here we take the date of the flight and place it into a month so that we can investiage seasonal trends and create our
#outcome variable of "ON_TIME_ARRIVAL." This is what we will be predicting
dfReduced['MONTH'] = dfReduced['FL_DATE'].str[5:7]
conditions = [
(dfReduced['CRS_ARR_TIME'] >= dfReduced['ARR_TIME']),
(dfReduced['CRS_ARR_TIME'] < dfReduced['ARR_TIME'])]
choices = [1, 0]
dfReduced['ON_TIME_ARRIVAL'] = np.select(conditions, choices, default='null')
We are dropping rows that are NA for arrival time, expected arrival time, actual elapsed time, and air time. We are only interested in flights that arrived at their destination.
Arrival time is needed to calculate OTA in order to train the model. ACTUAL_ELAPSED_TIME and AIR_TIME are related to the how long it took to flight to arrive at it's destination. We drop the NAs because only flights that actually arrive will have a value.
For our delay variables, we set the NAs to 0.0, meaning there was no delay. The previous value wa NaN which can't be calculated using logistic regression.
# We are only interested in flights that actually arrived.
# NAs are very few for these columns. We will drop them.
dfReduced = dfReduced.dropna(subset=['ARR_TIME', 'CRS_ELAPSED_TIME', 'ACTUAL_ELAPSED_TIME', 'AIR_TIME'])
# Set NaN values to 0.0 to show zero minutes for delay time. Used array and for loop to avoid repetetive code
def replaceNaN(data, arr):
for delayType in arr:
data[delayType] = data[delayType].fillna(0.0)
return data
dfReduced = replaceNaN(dfReduced, delayArr)
#Now that we have month, lets get rid of FL_DATE as the information is captured
del dfReduced['FL_DATE']
#Lets reduce to top 10 airports by passengers for 2018
#https://en.wikipedia.org/wiki/List_of_the_busiest_airports_in_the_United_States#Busiest_U.S._airports_by_total_passenger_traffic_(2018)
dfReduced=dfReduced[dfReduced.ORIGIN.eq("ATL")| dfReduced.ORIGIN.eq("LAX")| dfReduced.ORIGIN.eq("ORD")| dfReduced.ORIGIN.eq("DFW")| dfReduced.ORIGIN.eq("DEN")| dfReduced.ORIGIN.eq("JFK")| dfReduced.ORIGIN.eq("SFO")| dfReduced.ORIGIN.eq("SEA")| dfReduced.ORIGIN.eq("LAS")| dfReduced.ORIGIN.eq("MCO")]
#print(dfReduced)
dfReduced[['OP_CARRIER','ORIGIN','MONTH','ON_TIME_ARRIVAL']].describe().transpose()
| count | unique | top | freq | |
|---|---|---|---|---|
| OP_CARRIER | 45284 | 18 | DL | 8443 |
| ORIGIN | 45284 | 10 | ATL | 8049 |
| MONTH | 45284 | 12 | 07 | 4123 |
| ON_TIME_ARRIVAL | 45284 | 2 | 1 | 29072 |
dfReduced.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 45284 entries, 929508 to 6830663 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 OP_CARRIER 45284 non-null object 1 ORIGIN 45284 non-null object 2 CRS_DEP_TIME 45284 non-null int64 3 DEP_TIME 45284 non-null float64 4 DEP_DELAY 45284 non-null float64 5 WHEELS_OFF 45284 non-null float64 6 WHEELS_ON 45284 non-null float64 7 CRS_ARR_TIME 45284 non-null int64 8 ARR_TIME 45284 non-null float64 9 ARR_DELAY 45284 non-null float64 10 CRS_ELAPSED_TIME 45284 non-null float64 11 ACTUAL_ELAPSED_TIME 45284 non-null float64 12 AIR_TIME 45284 non-null float64 13 CARRIER_DELAY 45284 non-null float64 14 WEATHER_DELAY 45284 non-null float64 15 NAS_DELAY 45284 non-null float64 16 SECURITY_DELAY 45284 non-null float64 17 LATE_AIRCRAFT_DELAY 45284 non-null float64 18 MONTH 45284 non-null object 19 ON_TIME_ARRIVAL 45284 non-null object dtypes: float64(14), int64(2), object(4) memory usage: 7.3+ MB
# perform one-hot encoding of the categorical data "OP_CARRIER"
tmp_df = pd.get_dummies(dfReduced.OP_CARRIER,prefix="Operating Carrier")
dfReduced = pd.concat((dfReduced,tmp_df),axis=1)
# perform one-hot encoding of the categorical data "ORIGIN"
tmp_df = pd.get_dummies(dfReduced.ORIGIN,prefix="Origin")
dfReduced = pd.concat((dfReduced,tmp_df),axis=1)
# perform one-hot encoding of the categorical data "Month"
tmp_df = pd.get_dummies(dfReduced.MONTH,prefix="Month")
dfReduced = pd.concat((dfReduced,tmp_df),axis=1)
#Kill variables we hot-encoded
del dfReduced['OP_CARRIER']
del dfReduced['ORIGIN']
del dfReduced['MONTH']
#Drop Arr_delay, because this is essentially a proxy for 'ON_TIME_ARRIVAL'
del dfReduced['ARR_DELAY']
dfReduced.info()
#Create this df for use in SVM
SVMDat=dfReduced.copy()
#SVMDat.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 45284 entries, 929508 to 6830663 Data columns (total 56 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRS_DEP_TIME 45284 non-null int64 1 DEP_TIME 45284 non-null float64 2 DEP_DELAY 45284 non-null float64 3 WHEELS_OFF 45284 non-null float64 4 WHEELS_ON 45284 non-null float64 5 CRS_ARR_TIME 45284 non-null int64 6 ARR_TIME 45284 non-null float64 7 CRS_ELAPSED_TIME 45284 non-null float64 8 ACTUAL_ELAPSED_TIME 45284 non-null float64 9 AIR_TIME 45284 non-null float64 10 CARRIER_DELAY 45284 non-null float64 11 WEATHER_DELAY 45284 non-null float64 12 NAS_DELAY 45284 non-null float64 13 SECURITY_DELAY 45284 non-null float64 14 LATE_AIRCRAFT_DELAY 45284 non-null float64 15 ON_TIME_ARRIVAL 45284 non-null object 16 Operating Carrier_9E 45284 non-null uint8 17 Operating Carrier_AA 45284 non-null uint8 18 Operating Carrier_AS 45284 non-null uint8 19 Operating Carrier_B6 45284 non-null uint8 20 Operating Carrier_DL 45284 non-null uint8 21 Operating Carrier_EV 45284 non-null uint8 22 Operating Carrier_F9 45284 non-null uint8 23 Operating Carrier_G4 45284 non-null uint8 24 Operating Carrier_HA 45284 non-null uint8 25 Operating Carrier_MQ 45284 non-null uint8 26 Operating Carrier_NK 45284 non-null uint8 27 Operating Carrier_OH 45284 non-null uint8 28 Operating Carrier_OO 45284 non-null uint8 29 Operating Carrier_UA 45284 non-null uint8 30 Operating Carrier_VX 45284 non-null uint8 31 Operating Carrier_WN 45284 non-null uint8 32 Operating Carrier_YV 45284 non-null uint8 33 Operating Carrier_YX 45284 non-null uint8 34 Origin_ATL 45284 non-null uint8 35 Origin_DEN 45284 non-null uint8 36 Origin_DFW 45284 non-null uint8 37 Origin_JFK 45284 non-null uint8 38 Origin_LAS 45284 non-null uint8 39 Origin_LAX 45284 non-null uint8 40 Origin_MCO 45284 non-null uint8 41 Origin_ORD 45284 non-null uint8 42 Origin_SEA 45284 non-null uint8 43 Origin_SFO 45284 non-null uint8 44 Month_01 45284 non-null uint8 45 Month_02 45284 non-null uint8 46 Month_03 45284 non-null uint8 47 Month_04 45284 non-null uint8 48 Month_05 45284 non-null uint8 49 Month_06 45284 non-null uint8 50 Month_07 45284 non-null uint8 51 Month_08 45284 non-null uint8 52 Month_09 45284 non-null uint8 53 Month_10 45284 non-null uint8 54 Month_11 45284 non-null uint8 55 Month_12 45284 non-null uint8 dtypes: float64(13), int64(2), object(1), uint8(40) memory usage: 7.6+ MB
#Investigate suitability of data for logistic regression
#This data is quite imblanaced and could result in reasonably high classifcation by just assuming flights arrive on time.
#This will need to be fixed. We correct this a few sections down, but its important to visualize now.
import seaborn as sns
dfReduced['ON_TIME_ARRIVAL'].value_counts()
sns.countplot(x='ON_TIME_ARRIVAL',data=dfReduced,palette='hls')
<AxesSubplot:xlabel='ON_TIME_ARRIVAL', ylabel='count'>
#This section scales the variables to values between 0 and 1 in order to get the data ready for logistic regression.
#We only scale variables that aren't already between 0 and 1
from sklearn.preprocessing import MinMaxScaler
# Scale only columns that have values greater than 1
mms = MinMaxScaler()
dfScaled=dfReduced
dfScaled[["CRS_DEP_TIME", "DEP_DELAY","WHEELS_OFF","WHEELS_ON","CRS_ARR_TIME","ARR_TIME",
"CRS_ELAPSED_TIME","ACTUAL_ELAPSED_TIME","AIR_TIME","CARRIER_DELAY","WEATHER_DELAY","NAS_DELAY",
"SECURITY_DELAY","LATE_AIRCRAFT_DELAY"]] = mms.fit_transform(dfScaled[["CRS_DEP_TIME", "DEP_DELAY","WHEELS_OFF","WHEELS_ON","CRS_ARR_TIME","ARR_TIME",
"CRS_ELAPSED_TIME","ACTUAL_ELAPSED_TIME","AIR_TIME","CARRIER_DELAY","WEATHER_DELAY","NAS_DELAY",
"SECURITY_DELAY","LATE_AIRCRAFT_DELAY"]])
print(dfScaled)
CRS_DEP_TIME DEP_TIME DEP_DELAY WHEELS_OFF WHEELS_ON \
929508 0.701442 1925.0 0.149042 0.810338 0.922885
2390254 0.611111 1441.0 0.023314 0.606920 0.643602
1797438 0.347328 944.0 0.094088 0.419758 0.729887
4821960 0.360051 850.0 0.024147 0.375990 0.416424
4330544 0.735793 1818.0 0.059117 0.765736 0.797832
... ... ... ... ... ...
2987493 0.765055 1804.0 0.023314 0.761150 0.844519
4443275 0.784139 1851.0 0.024979 0.794081 0.807003
5364170 0.215013 502.0 0.019151 0.214256 0.309296
5364381 0.871925 2109.0 0.034138 0.885786 0.020425
6830663 0.890161 2052.0 0.017485 0.876198 0.185911
CRS_ARR_TIME ARR_TIME CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME \
929508 0.802418 0.927887 0.178248 0.232628
2390254 0.647770 0.646103 0.060423 0.063444
1797438 0.686119 0.751563 0.444109 0.441088
4821960 0.422676 0.417674 0.078550 0.067976
4330544 0.766986 0.799083 0.048338 0.048338
... ... ... ... ...
2987493 0.846603 0.847436 0.353474 0.365559
4443275 0.814506 0.810754 0.320242 0.312689
5364170 0.335140 0.313464 0.126888 0.125378
5364381 0.022509 0.022093 0.219033 0.206949
6830663 0.213422 0.187995 0.422961 0.410876
AIR_TIME ... Month_03 Month_04 Month_05 Month_06 Month_07 \
929508 0.208909 ... 0 0 0 0 0
2390254 0.052227 ... 0 0 1 0 0
1797438 0.414747 ... 0 1 0 0 0
4821960 0.066052 ... 0 0 0 0 0
4330544 0.035330 ... 0 0 0 0 0
... ... ... ... ... ... ... ...
2987493 0.347158 ... 0 0 0 1 0
4443275 0.302611 ... 0 0 0 0 0
5364170 0.113671 ... 0 0 0 0 0
5364381 0.199693 ... 0 0 0 0 0
6830663 0.414747 ... 0 0 0 0 0
Month_08 Month_09 Month_10 Month_11 Month_12
929508 0 0 0 0 0
2390254 0 0 0 0 0
1797438 0 0 0 0 0
4821960 1 0 0 0 0
4330544 1 0 0 0 0
... ... ... ... ... ...
2987493 0 0 0 0 0
4443275 1 0 0 0 0
5364170 0 1 0 0 0
5364381 0 1 0 0 0
6830663 0 0 0 0 1
[45284 rows x 56 columns]
We are reusing code from the data mining notebook at https://github.com/jakemdrew/DataMiningNotebooks, to perform the training and test split, logistic regression, support vector machine, and interpretin feature importance.
#This is the section where we begin to split up the data
#This is where we create the 80% test, 20%train split
from sklearn.model_selection import train_test_split
#Drop outcome variable from x and keep in y
X = dfScaled.drop('ON_TIME_ARRIVAL', axis=1)
y = dfScaled['ON_TIME_ARRIVAL']
XSVM = dfScaled.drop('ON_TIME_ARRIVAL', axis=1)
ySVM = dfScaled['ON_TIME_ARRIVAL']
#Direct train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42
)
#Examine positive class in train/test split to examine balance between delayed flights and on time.
#We are currently about 2/3 on time, and 1/3 delayed. Meaning we could do pretty well by guessing a flight is on time
#This is an issue since we want to know if a flight will be delayed. Fix is in next section
print(f'''% Positive class in Train = {np.round(y_train.value_counts(normalize=True)[1] * 100, 2)}
% Positive class in Test = {np.round(y_test.value_counts(normalize=True)[1] * 100, 2)}''')
% Positive class in Train = 35.7 % Positive class in Test = 36.2
from imblearn.over_sampling import SMOTE
#Use SMOTE to balance the outcome variable
sm = SMOTE(random_state=25)
X_sm, y_sm = sm.fit_resample(X, y)
#We now have a 50/50 split of on time and delayed flights
print('\nBalance of positive and negative classes (%):')
y_sm.value_counts(normalize=True) * 100
Balance of positive and negative classes (%):
0 50.0 1 50.0 Name: ON_TIME_ARRIVAL, dtype: float64
#Testing the outcome variable
#print(y_sm)
from sklearn.model_selection import ShuffleSplit
#Want to predict and x and y using the balanced data
y = y_sm.values # get the labels we want
X = X_sm.values # use everything else to predict!
#Use CV objec
#Do 5 iterations
num_cv_iterations = 5
num_instances = len(y)
#Notice set as random state
cv_object = ShuffleSplit(n_splits=num_cv_iterations,random_state=3,test_size = 0.2)
#Print CV object to ensure everything correct
print(cv_object)
ShuffleSplit(n_splits=5, random_state=3, test_size=0.2, train_size=None)
To split up the data into a training and test data set we used an 80% training data and 20% testing data split. As specified in our first project we used a 5 fold cross validation as our metric. We also should specify that we are judging model performance based on an accuracy metric. We are aiming for at least 80% accuracy in order to consider an algorithm successful. As a reminder here is the formula we use for accuracy in our confusion matrices:
$accuracy = {\frac{True Positive+True Negative}{True Positive + True Negative +False Positive +False Negative}}$
%%time
# run logistic regression and vary some parameters
from sklearn.linear_model import LogisticRegression
from sklearn import metrics as mt
#Create LR object
lr_clf = LogisticRegression(penalty='l2', C=1.0, class_weight=None, solver='liblinear' ) # get object
#Iterate through different train/test splits
iter_num=0
for train_indices, test_indices in cv_object.split(X,y):
X_train = X[train_indices]
y_train = y[train_indices]
X_test = X[test_indices]
y_test = y[test_indices]
# train the LR model on the training data
lr_clf.fit(X_train,y_train)
y_hat = lr_clf.predict(X_test)
# now let's get the accuracy and confusion matrix for this iterations of training/testing
acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print("====Iteration",iter_num," ====")
print("accuracy", acc )
print("confusion matrix\n",conf)
iter_num+=1
====Iteration 0 ==== accuracy 0.8930260555507782 confusion matrix [[4596 1216] [ 28 5789]] ====Iteration 1 ==== accuracy 0.8976696190558088 confusion matrix [[4627 1168] [ 22 5812]] ====Iteration 2 ==== accuracy 0.8953478373032935 confusion matrix [[4686 1203] [ 14 5726]] ====Iteration 3 ==== accuracy 0.9005933442256427 confusion matrix [[4693 1140] [ 16 5780]] ====Iteration 4 ==== accuracy 0.899475449307765 confusion matrix [[4682 1148] [ 21 5778]] CPU times: user 5.6 s, sys: 550 ms, total: 6.15 s Wall time: 3.5 s
#import sys
#!{sys.executable} -m pip install sklearn.model_selection
# and here is an even shorter way of getting the accuracies for each training and test set
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
accuracies = cross_val_score(lr_clf, X, y=y, cv=cv_object)
print(accuracies)
[0.89302606 0.89766962 0.89534784 0.90059334 0.89947545]
%%time
# here we can change some of the parameters interactively
from ipywidgets import widgets as wd
def lr_explor(cost):
lr_clf = LogisticRegression(penalty='l2', C=cost, class_weight=None,solver='liblinear') # get object
accuracies = cross_val_score(lr_clf,X,y=y,cv=cv_object) # this also can help with parallelism
print(accuracies)
wd.interact(lr_explor,cost=(0.001,5.0,0.05),__manual=True)
CPU times: user 6.47 s, sys: 510 ms, total: 6.98 s Wall time: 3.99 s
<function __main__.lr_explor(cost)>
from sklearn.model_selection import ShuffleSplit
# we want to predict the X and y data as follows:
#Have to reset data, we used scaled data for LR.
if 'ON_TIME_ARRIVAL' in SVMDat:
ySVM = SVMDat['ON_TIME_ARRIVAL'].values
del SVMDat['ON_TIME_ARRIVAL']
XSVM = SVMDat.values
num_cv_iterationsSVM = 5
num_instances = len(ySVM)
#Notice set as random state
cv_objectSVM = ShuffleSplit(n_splits=num_cv_iterationsSVM,
test_size = 0.2,train_size=0.8,random_state=34)
print(cv_objectSVM)
ShuffleSplit(n_splits=5, random_state=34, test_size=0.2, train_size=0.8)
from sklearn import svm
# train object
#Iterate through different train/test splits
svm_clf = svm.SVC(C=0.5, kernel='rbf', degree=3, gamma='auto') # get object
iter_num=0
for train_indicesSVM, test_indicesSVM in cv_objectSVM.split(XSVM,ySVM):
X_trainSVM = XSVM[train_indicesSVM]
y_trainSVM = ySVM[train_indicesSVM]
X_testSVM = XSVM[test_indicesSVM]
y_testSVM = ySVM[test_indicesSVM]
# scale attributes by the training set
scl_obj = StandardScaler()
scl_obj.fit(X_trainSVM) # find scalings for each column that make this zero mean and unit std
# the line of code above only looks at training data to get mean and std and we can use it
# to transform new feature data
X_train_scaled = scl_obj.transform(X_trainSVM) # apply to training
X_test_scaled = scl_obj.transform(X_testSVM)
svm_clf.fit(X_train_scaled, y_trainSVM)
y_hatSVM = svm_clf.predict(X_test_scaled) # get test set precitions
accSVM = mt.accuracy_score(y_testSVM,y_hatSVM)
confSVM = mt.confusion_matrix(y_testSVM,y_hatSVM)
print("====Iteration",iter_num," ====")
print("accuracy", accSVM)
print("confusion matrix\n",confSVM)
iter_num+=1
====Iteration 0 ==== accuracy 0.8527106105774539 confusion matrix [[1928 1303] [ 31 5795]] ====Iteration 1 ==== accuracy 0.8583416142210445 confusion matrix [[1954 1250] [ 33 5820]] ====Iteration 2 ==== accuracy 0.8614331456332118 confusion matrix [[1988 1229] [ 26 5814]] ====Iteration 3 ==== accuracy 0.8602186154355747 confusion matrix [[1908 1251] [ 15 5883]] ====Iteration 4 ==== accuracy 0.861985204813956 confusion matrix [[1957 1223] [ 27 5850]]
#Just like in LR, we condense the results so they can be used below
#from sklearn.model_selection import cross_val_score
#from sklearn.model_selection import cross_val_score
#from sklearn.model_selection import train_test_split
#accuracies = cross_val_score(svm_clf, XSVM, y=ySVM, cv=cv_objectSVM)
#print(accuracies)
%%time
#This block allows us to tune paramters (Cost, gamma, and Kernel)
#This can run very slowly!
def SVM_explor(cost,gamma,kernel):
svm_clf = svm.SVC(C=cost, kernel=kernel, degree=3, gamma=gamma) # get object
svm_clf.fit(X_train_scaled, y_trainSVM) # train object
y_hatSVM = svm_clf.predict(X_test_scaled)
accuracies = cross_val_score(svm_clf,XSVM,y=ySVM,cv=cv_objectSVM) # this also can help with parallelism
print(accuracies)
wd.interact(SVM_explor,cost=(0.001,5.0,0.05),gamma=(0.1,1,0.05),kernel=['linear','rbf','poly'],__manual=True)
CPU times: user 1min 32s, sys: 545 ms, total: 1min 33s Wall time: 1min 33s
<function __main__.SVM_explor(cost, gamma, kernel)>
| Model Type | Cost | Gamma | Accuracy | Time | Kernel |
|---|---|---|---|---|---|
| Logistic | 1.0 | N/A | 0.8932 | 6.82s (5 iterations) | N/A |
| Logistic | 2.5 | N/A | 0.9237 | 6.85s (5 iterations) | N/A |
| Logistic | 4.95 | N/A | 0.9222 | 6.64s (5 iterations) | N/A |
| SVM | 1.0 | 0.10 | .998 | 1m (5 iterations) | linear |
| SVM | 2.5 | 0.5 | .999 | 1.2m (5 iterations) | linear |
| SVM | 5.0 | 1.0 | .997 | 1.1m (5 iterations) | linear |
| SVM | 1.0 | auto | 0.85 | 2m 35m (5 iterations) | RBF |
| SVM | 2.5 | 0.5 | 0.97 | 5m 25s (5 iterations) | Poly |
The results from each modeling run are reasonable, however there are a few key differences in the performanece in terms of prediction accuracy, training time, and efficiency. General Conclusion: The logistic regression model can iterate though 5 different folds in a matter of seconds and achieves between 89%-92% accuracy. Logistic regression offers superior performance in terms of prediction accuracy. (I say this despite the SVM reaching 99% accuracy, as I feel SVM with linear kernel overfits the data). I believe that in such a large dataset there is a fair trade-off between LR's prediction accuracy being sometimes lower in exchange for its much faster execution time and somewhat easier interpretation.
The logistic regression required a bit more pre-processing of the data. We had to scale the data, hot-encode, and re-sample using SMOTE to get reliable results from the LR model. However, since we put in the effort into pre-processing the data, the modeling aspect of the LR model was straightforward. The assumptions were met and the results seemed reasonable, and fit our metric of success of classification accuracy of >.8. We can see the classifer performs because it properly buckets on time and not on time flights and doesn't simply classify all as on time (which could still yield ~60% accuracy due to the data split), resampling and proper sampling weights fixed this potential issue. The execution time of the logistic regression model is also many orders of magnitude shorter than the competition, making this the preferred model on this merit alone. With model turning (cost parameter) we can also get the classification into the low 90's.
The SVM model was more difficult, and mainly becuase of the extremely long computation times required to test each iteration of the alogrithm. However, this model can be more finely tuned (cost, gamma, and kernel). The linear kernel ran the quickest, but produced somewhat unbelieveable results (prediction accuracy as high as 1). However after some work the poly kernel (also produced unbelieveable results), and the rbf kernel produced usable results. The rbf kernel, despite a long runtime, created beleiveable results around .85% accuracy. However these are lower than what the LR model produced and had a much longer compute time. Still, it was good to see another model validate the data.
from sklearn.preprocessing import StandardScaler
#This section normalized features
# scale attributes by the training set
scl_obj = StandardScaler()
scl_obj.fit(X_train) # find scalings for each column that make this zero mean and unit std
# the line of code above only looks at training data to get mean and std and we can use it
# to transform new feature data
# apply to training
X_train_scaled = scl_obj.transform(X_train)
# apply those means and std to the test set
X_test_scaled = scl_obj.transform(X_test)
# Train model
lr_clf = LogisticRegression(penalty='l2', C=0.05, solver='liblinear')
lr_clf.fit(X_train_scaled,y_train)
#Make predictions
y_hat = lr_clf.predict(X_test_scaled)
#Get Accuracy and generate CM
acc = mt.accuracy_score(y_test,y_hat)
conf = mt.confusion_matrix(y_test,y_hat)
print('accuracy:', acc )
print(conf )
# sort these attributes
zip_vars = zip(lr_clf.coef_.T,dfScaled.columns) # combine attributes
zip_vars = sorted(zip_vars)
for coef, name in zip_vars:
print(name, 'has weight of', coef[0]) # now print them out
accuracy: 0.9599277667899218 [[5399 431] [ 35 5764]] ACTUAL_ELAPSED_TIME has weight of -9.201018584642721 ARR_TIME has weight of -5.859526598296983 DEP_DELAY has weight of -5.613312534100306 WHEELS_ON has weight of -0.297451863559278 DEP_TIME has weight of -0.2509415987824683 WHEELS_OFF has weight of -0.14824359007635143 SECURITY_DELAY has weight of -0.06592359787363292 Operating Carrier_NK has weight of 0.05197089887280508 Operating Carrier_G4 has weight of 0.08076043734596124 Operating Carrier_F9 has weight of 0.1062326254186496 Operating Carrier_UA has weight of 0.11835645654082892 Operating Carrier_YV has weight of 0.22602022325796675 Operating Carrier_WN has weight of 0.2381519215931969 Operating Carrier_DL has weight of 0.24536445279624775 Operating Carrier_EV has weight of 0.2457658511473291 ON_TIME_ARRIVAL has weight of 0.30334991660597604 CRS_DEP_TIME has weight of 0.3238918415257512 Operating Carrier_MQ has weight of 0.35970303783534324 Operating Carrier_AS has weight of 0.37767848065963633 AIR_TIME has weight of 0.42039429258168487 Operating Carrier_AA has weight of 0.4338909738230082 Operating Carrier_HA has weight of 0.4400435305824895 Origin_DFW has weight of 0.4627684818615293 Origin_ORD has weight of 0.47208286386074716 Origin_LAX has weight of 0.47817863415697354 Origin_JFK has weight of 0.4859072735967373 Origin_SEA has weight of 0.5261872991367943 NAS_DELAY has weight of 0.565663569017054 Origin_LAS has weight of 0.5757561712125924 Origin_ATL has weight of 0.6077882413653098 Origin_DEN has weight of 0.6165573175345616 Operating Carrier_OH has weight of 0.634804403440824 Origin_MCO has weight of 0.6702265576256436 Operating Carrier_VX has weight of 0.679585484668595 Operating Carrier_OO has weight of 0.6967965324789217 Operating Carrier_9E has weight of 0.7175913826617408 Operating Carrier_YX has weight of 0.733556239498438 Month_01 has weight of 0.7517667666268192 Month_11 has weight of 0.7746744268490674 Month_10 has weight of 0.7780517783413488 Origin_SFO has weight of 0.7835362388073314 Month_05 has weight of 0.7914055506763922 Month_03 has weight of 0.7917193482105465 Month_09 has weight of 0.7948698148365005 Month_02 has weight of 0.8081357421207184 Month_07 has weight of 0.8105724452944064 Month_06 has weight of 0.8138545529379896 Month_08 has weight of 0.8144560691175288 Month_04 has weight of 0.835141369002159 Operating Carrier_B6 has weight of 0.8397881787567412 WEATHER_DELAY has weight of 1.285363525349125 LATE_AIRCRAFT_DELAY has weight of 1.6477285462671103 CARRIER_DELAY has weight of 2.3628109253295975 CRS_ARR_TIME has weight of 5.950919728126783 CRS_ELAPSED_TIME has weight of 8.677556065877273
As this is a numeric variable, the interpretation is that all else being equal longer flights are less likely to be delayed (late arrival) this is possible because although flights sometimes leave late, they can make up for lost time en-route, and this is confouded on longer flights. </mark>
As this is a numeric variable, the interpretation does not make sense and this variable should be removed from the dataset.
As this is a numeric variable, the interpretation is that all else being equal departure delayed flights are less likely to be delayed (late arrival) this is possible because although flights sometimes leave late, they can make up for lost time en-route, and this is confouded on longer flights. This is somewhat counter-intutive, but we must realize that a depature delay is late as soon as one minute passes since depature time. The incentive to make up for time could be great here.
As this is a numeric variable, the interpretation is that all else being equal, flights with longer wheels on the ground are less likely to be delayed (late arrival) this is possible because although flights sometimes leave late, they can make up for lost time en-route, and this is confouded on longer flights. Wheels-on is when the wheels touch down at the arrival airport. As this is a numeric variable and a time stamp, the interpretation does not make sense and this variable should be removed from the dataset.
As this is a numeric variable and a time stamp, the interpretation does not make sense and this variable should be removed from the dataset.
As this is a numeric variable, the interpretation is that all else being equal, flights with longer wheels on the ground are less likely to be delayed (late arrival) this is possible because although flights sometimes leave late, they can make up for lost time en-route, and this is confouded on longer flights. Wheels-on is when the wheels leave the ground at the departure airport. As this is a numeric variable and a time stamp, the interpretation does not make sense and this variable should be removed from the dataset.
Security delays are the least impactful forms of delay, these are often short in nature and are the delay type most associated with a flight arriving on time.
ALl things being equal, operating carrier on its own seems to have very little to do with delays. All operating carriers have positive weights meaning that they are associated with delays. Spirit Airlines being the least impactful offender and JetBlue carrying the most weight when it comes to a delayed arrival. An interpretation of this could be that all things held equal, if a JetBlue flight is delayed, of all the carriers it will be most likely to arrive late too.
As this is a numeric variable and a time stamp, the interpretation does not make sense and this variable should be removed from the dataset.
As this is a numeric variable, the interpretation is that all else being equal, the more air time, the more likely a flight is to arrive on time. We have previously discussed a reason for this being more air time is more of an opportunity to make up for the various forms of depature delays.
As this is a numeric variable, the interpretation is that all else being equal, flights originiating out of MCO (orlando) are twice as likely to have an arrival delay as a flight departing from (DFW). This could be for many reasons, MCO is on average further away from destinations than DFW, and events causing departure delays as MCO could have a longer span than delay events originating at DFW. The other 8 origins fall somewhere between.
SFO is the worst for delays, and is for a variety of reasons, weather, geography, airport congestion.
As this is a numeric variable, the interpretation is that all else being equal, this is a delay in the form of weather delays that are caused by the air system. These types of delays are not very impactful on arrival times.
As this is a numeric variable, the interpretation is that all else being equal, flights originating in April are about twice as likely to have an arrival delay as a flight departing in January (November is another good month to avoid depature delays). This could be for a variety of reasons, weather, seasonality of travel.
This was actually a very good finding, to check our results I googled, "best and worst months to travel" and our results are matched.
https://www.refund.me/best-months-to-travel/
As this is a numeric variable, the interpretation is that all else being equal, this is a delay in the form of weather delays, this is out of everyones control and is beginning to have an impact on arrival times.
As this is a numeric variable, the interpretation is that all else being equal, this is a delay in the form of an inbound aircraft is late for some reason (potentially one of the delay forms) and this causes ripple effects in the system.
As this is a numeric variable, the interpretation is that all else being equal, this is a delay in the form of an airline induced delay. This could be maintenance, crew scheduling, late connecting passengers, and other things on the operations side of an airline. This is the type of delay most commonly associated with heavy (arrival) dealys.
As this is a numeric variable and a time stamp, the interpretation does not make sense and this variable should be removed from the dataset.
As this is a numeric variable and a measure of a flights length the interpretation does not make sense and this variable should be removed from the dataset.
</mark>
from sklearn.preprocessing import StandardScaler
# we want to normalize the features based upon the mean and standard deviation of each column.
# However, we do not want to accidentally use the testing data to find out the mean and std (this would be snooping)
from sklearn.pipeline import Pipeline
# you can apply the StandardScaler function inside of the cross-validation loop
# but this requires the use of PipeLines in scikit.
# A pipeline can apply feature pre-processing and data fitting in one compact notation
# Here is an example!
std_scl = StandardScaler()
lr_clf = LogisticRegression(penalty='l2', C=0.05, solver='liblinear')
# create the pipline
piped_object = Pipeline([('scale', std_scl), # do this
('logit_model', lr_clf)]) # and then do this
weights = []
# run the pipline cross validated
for iter_num, (train_indices, test_indices) in enumerate(cv_object.split(X,y)):
piped_object.fit(X[train_indices],y[train_indices]) # train object
# it is a little odd getting trained objects from a pipeline:
weights.append(piped_object.named_steps['logit_model'].coef_[0])
weights = np.array(weights)
#import sys
#!{sys.executable} -m pip install plotly
import plotly
plotly.offline.init_notebook_mode() # run at the start of every notebook
error_y=dict(
type='data',
array=np.std(weights,axis=0),
visible=True
)
graph1 = {'x': dfScaled.columns,
'y': np.mean(weights,axis=0),
'error_y':error_y,
'type': 'bar'}
fig = dict()
fig['data'] = [graph1]
fig['layout'] = {'title': 'Logistic Regression Weights, with error bars'}
plotly.offline.iplot(fig)
# look at the support vectors
print(svm_clf.support_vectors_.shape)
print(svm_clf.support_.shape)
print(svm_clf.n_support_ )
(15296, 55) (15296,) [7412 7884]
#Examine instances chosen as support vectors
# make a dataframe of the training data
df_tested_on = SVMDat.iloc[train_indicesSVM].copy() # saved from above, the indices chosen for training
print(SVMDat.iloc[train_indicesSVM])
# now get the support vectors from the trained model
df_support = df_tested_on.iloc[svm_clf.support_,:].copy()
df_support.info()
#print(train_indicesSVM.shape)
df_support['ON_TIME_ARRIVAL'] = ySVM[svm_clf.support_]
SVMDat['ON_TIME_ARRIVAL'] = ySVM
#df_support.info()
CRS_DEP_TIME DEP_TIME DEP_DELAY WHEELS_OFF WHEELS_ON \
6053548 1745 1740.0 -5.0 1753.0 1905.0
1595084 1829 1904.0 35.0 1915.0 2207.0
1977183 1640 1638.0 -2.0 1651.0 1907.0
4891765 2127 2323.0 116.0 2337.0 105.0
4754142 1459 1458.0 -1.0 1518.0 1724.0
... ... ... ... ... ...
7032120 1839 1907.0 28.0 1936.0 2120.0
3776161 1339 1337.0 -2.0 1406.0 1451.0
1261819 1928 1929.0 1.0 1943.0 2016.0
2132494 915 911.0 -4.0 922.0 1036.0
3962973 915 909.0 -6.0 925.0 1403.0
CRS_ARR_TIME ARR_TIME CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME \
6053548 1912 1909.0 87.0 89.0
1595084 2147 2216.0 138.0 132.0
1977183 1917 1920.0 157.0 162.0
4891765 2330 112.0 303.0 289.0
4754142 1732 1726.0 93.0 88.0
... ... ... ... ...
7032120 2140 2127.0 301.0 260.0
3776161 1454 1456.0 75.0 79.0
1261819 2024 2021.0 56.0 52.0
2132494 1056 1040.0 101.0 89.0
3962973 1412 1408.0 177.0 179.0
AIR_TIME ... Month_04 Month_05 Month_06 Month_07 Month_08 \
6053548 72.0 ... 0 0 0 0 0
1595084 112.0 ... 0 0 0 0 0
1977183 136.0 ... 1 0 0 0 0
4891765 268.0 ... 0 0 0 0 0
4754142 66.0 ... 0 0 0 0 1
... ... ... ... ... ... ... ...
7032120 224.0 ... 0 0 0 0 0
3776161 45.0 ... 0 0 0 1 0
1261819 33.0 ... 0 0 0 0 0
2132494 74.0 ... 1 0 0 0 0
3962973 158.0 ... 0 0 0 1 0
Month_09 Month_10 Month_11 Month_12 ON_TIME_ARRIVAL
6053548 0 0 1 0 1
1595084 0 0 0 0 0
1977183 0 0 0 0 0
4891765 1 0 0 0 1
4754142 0 0 0 0 1
... ... ... ... ... ...
7032120 0 0 0 1 1
3776161 0 0 0 0 0
1261819 0 0 0 0 1
2132494 0 0 0 0 1
3962973 0 0 0 0 1
[36227 rows x 56 columns]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 15296 entries, 1977183 to 3962973
Data columns (total 56 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CRS_DEP_TIME 15296 non-null int64
1 DEP_TIME 15296 non-null float64
2 DEP_DELAY 15296 non-null float64
3 WHEELS_OFF 15296 non-null float64
4 WHEELS_ON 15296 non-null float64
5 CRS_ARR_TIME 15296 non-null int64
6 ARR_TIME 15296 non-null float64
7 CRS_ELAPSED_TIME 15296 non-null float64
8 ACTUAL_ELAPSED_TIME 15296 non-null float64
9 AIR_TIME 15296 non-null float64
10 CARRIER_DELAY 15296 non-null float64
11 WEATHER_DELAY 15296 non-null float64
12 NAS_DELAY 15296 non-null float64
13 SECURITY_DELAY 15296 non-null float64
14 LATE_AIRCRAFT_DELAY 15296 non-null float64
15 Operating Carrier_9E 15296 non-null uint8
16 Operating Carrier_AA 15296 non-null uint8
17 Operating Carrier_AS 15296 non-null uint8
18 Operating Carrier_B6 15296 non-null uint8
19 Operating Carrier_DL 15296 non-null uint8
20 Operating Carrier_EV 15296 non-null uint8
21 Operating Carrier_F9 15296 non-null uint8
22 Operating Carrier_G4 15296 non-null uint8
23 Operating Carrier_HA 15296 non-null uint8
24 Operating Carrier_MQ 15296 non-null uint8
25 Operating Carrier_NK 15296 non-null uint8
26 Operating Carrier_OH 15296 non-null uint8
27 Operating Carrier_OO 15296 non-null uint8
28 Operating Carrier_UA 15296 non-null uint8
29 Operating Carrier_VX 15296 non-null uint8
30 Operating Carrier_WN 15296 non-null uint8
31 Operating Carrier_YV 15296 non-null uint8
32 Operating Carrier_YX 15296 non-null uint8
33 Origin_ATL 15296 non-null uint8
34 Origin_DEN 15296 non-null uint8
35 Origin_DFW 15296 non-null uint8
36 Origin_JFK 15296 non-null uint8
37 Origin_LAS 15296 non-null uint8
38 Origin_LAX 15296 non-null uint8
39 Origin_MCO 15296 non-null uint8
40 Origin_ORD 15296 non-null uint8
41 Origin_SEA 15296 non-null uint8
42 Origin_SFO 15296 non-null uint8
43 Month_01 15296 non-null uint8
44 Month_02 15296 non-null uint8
45 Month_03 15296 non-null uint8
46 Month_04 15296 non-null uint8
47 Month_05 15296 non-null uint8
48 Month_06 15296 non-null uint8
49 Month_07 15296 non-null uint8
50 Month_08 15296 non-null uint8
51 Month_09 15296 non-null uint8
52 Month_10 15296 non-null uint8
53 Month_11 15296 non-null uint8
54 Month_12 15296 non-null uint8
55 ON_TIME_ARRIVAL 15296 non-null object
dtypes: float64(13), int64(2), object(1), uint8(40)
memory usage: 2.6+ MB
%matplotlib inline
from matplotlib import pyplot as plt
# now lets see the statistics of these attributes
from pandas.plotting import boxplot
# group the original data and the support vectors
df_grouped_support = df_support.groupby(['ON_TIME_ARRIVAL'])
df_grouped = SVMDat.groupby(['ON_TIME_ARRIVAL'])
# plot KDE of Different variables
vars_to_plot = ['ACTUAL_ELAPSED_TIME','DEP_DELAY','ARR_TIME','CRS_ELAPSED_TIME','Operating Carrier_WN','Origin_SFO','CARRIER_DELAY']
for v in vars_to_plot:
plt.figure(figsize=(10,4))
# plot support vector stats
plt.subplot(1,2,1)
ax = df_grouped_support[v].plot.kde()
plt.legend(['ON_TIME_ARRIVAL','LATE'])
plt.title(v+' (Instances chosen as Support Vectors)')
# plot original distributions
plt.subplot(1,2,2)
ax = df_grouped[v].plot.kde()
plt.legend(['LATE','ON_TIME_ARRIVAL'])
plt.title(v+' (Original)')
The indices of support vectors is 15296 with 55 features.
This is the number of support vectors for each class. 'ON_TIME_ARRIVAL' has two classes 0='YES', 1='NO'
Support vectors are the data points that lie $closest$ to the decision surface and are the points that are most difficult to classify.
The features selected were ACTUAL_ELAPSED_TIME, DEP_DELAY, ARR_TIME, CRS_ELAPSED_TIME, Operating Carrier_WN, ORIGIN_SFO, and CARRIER_DELAY.
These features were selcted for their uniqueness in the dataset, ACTUAL_ELAPSED_TIME is an interesting one since this is a continuous variable and is a measure of the length of a flight, as it turns out elapsed time is associated with on time flights, although this is problematic since longer flights are often a result of going far distances, the support vectors here may throw longer flights on the on time side since they behave in the way descibed above. The seperation is not as great as the original distribution. This variable seems to be linearly seperable.
DEP_DELAY has less seperation than the original distribution, and this makes sense. A flight that has a departure delay is most likely going to have an arrival delay as well, this isn't always the case, but a strong indicator. The decision boundary here is a strong one and seperates the data well. This variable seems to be (very) linearly seperable.
ARR_TIME support vectors have less seperation when compared to the original data, however the distribution of data is roughly the same. This variable was chosen at random to help our understanding of support vecotrs, and to see if this variable has good performance on generalization.
CRS_ELAPSED_TIME has a bit less seperation than the original data, however there still seems to be linearly seperable. CRS_ELAPSED_TIME is also closely related to ACTUAL_ELAPSED_TIME and suffers from the same issues, namely assuming a flight has arrived late simply because it was a long flight.
Operating Carrier_WN was selected because Southwest has an interesting position of often leaving late, but arriving on time. This could present an interesting problem for the decision boundary. The support vectors very closely match the disribution of the original data.
Origin_SFO was selected due to its weather problems, and these problems are associated with late arrivals. If the aiport has bad weather, arriving flights will sometimes be delayed. The seperation of the support vectors is nearly identical to the original distribution of the data. This seems like the variable has good performance on generalization.
CARRER_DELAY was chosen as it is the most often reason for a flight to arrive late. This is interesting because this delay reason is entirely in the hands of the operating airline, unlike a something like a weather delay. The support vectors nearly match the distribution of the data, except they have less seperation than the original. This is a tough measure, because these delays are often very short and don't always result in an arrival delay.